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Abstract 

Bayesian optimization techniques have been 
successfully applied to robotics, planning, 
sensor placement, recommendation, advertis- 
ing, intelligent user interfaces and automatic 
algorithm configuration. Despite these suc- 
cesses, the approach is restricted to problems 
of moderate dimension, and several work- 
shops on Bayesian optimization have iden- 
tified its scaling to high-dimensions as one 
of the holy grails of the field. In this pa- 
per, we introduce a novel random embedding 
idea to attack this problem. The resulting 
Random EMbedding Bayesian Optimization 
(REMBO) algorithm is very simple, has im- 
portant invar iance properties, and applies to 
domains with both categorical and continu- 
ous variables. We present a thorough theo- 
retical analysis of REMBO, including regret 
bounds that only depend on the problem's in- 
trinsic dimensionality. Empirical results con- 
firm that REMBO can effectively solve prob- 
lems with billions of dimensions, provided the 
intrinsic dimensionality is low. They also 
show that REMBO achieves state-of-the-art 
performance in optimizing the 47 discrete pa- 
rameters of a popular mixed integer linear 
programming solver. 
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1. Introduction 

Let / : A' ^ R be a function on a compact subset 
X C R^. We address the following global optimization 
problem 

x"^ = argmax/(x). 

We are particularly interested in objective functions / 
that may satisfy one or more of the following criteria: 
they do not have a closed-form expression, are expen- 
sive to evaluate, do not have easily available deriva- 
tives, or are non-convex. We treat / as a blackbox 
function that only allows us to query its function value 
at arbitrary x G X. To address objectives of this chal- 
lenging nature, we adopt the Bayesian optimization 
framework. There is a rich literature on Bayesian op- 
timization, and we refer readers unfamiliar with the 
topic to more tutorial treatments (Brochu et al., 2009; 
Jones et al., 1998; Jones, 2001; Lizotte et al., 2011; 
Mockus, 1994; Osborne et al., 2009) and recent the- 
oretical results (Srinivas et al., 2010; Bull, 2011; de 
Freitas et al., 2012). 

Bayesian optimization has two ingredients. The first 
ingredient is a prior distribution that captures our be- 
liefs about the behavior of the unknown objective func- 
tion. The second ingredient is a risk function that de- 
scribes the deviation from the global optimum. The 
expected risk is used to decide where to sample next. 
After observing a few samples of the objective, the 
prior is updated to produce a more informative poste- 
rior distribution over the space of objective functions 
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(see Figure 1 in Brochu et al., 2009). One problem 
with this maximum expected utility framework is that 
the risk is typically very hard to compute. This has 
led to the introduction of many sequential heuristics, 
known as acquisition functions, including Thompson 
sampling (Thompson, 1933), probability of improve- 
ment (Jones, 2001), expected improvement (Mockus, 
1994) and upper-confidence-bounds (Srinivas et al., 
2010). These acquisition functions trade-off explo- 
ration and exploitation. They are typically optimized 
by choosing points where the predictive mean is high 
(exploitation) and where the variance is large (explo- 
ration). Since the acquisition functions above have 
an analytical expression that is easy to evaluate, they 
are much easier to optimize than the original objective 
function, using off-the-shelf numerical optimization al- 
gorithms. It is also possible to use dynamic portfolios 
of acquisition functions to improve the efficiency of the 
method (Hoffman et al., 2011). 

The term Bayesian optimization seems to have been 
coined several decades ago by Jonas Mockus (1982). 
A popular version of the method has been known as 
efficient global optimization in the experimental design 
literature since the 1990s (Jones et al., 1998). Often, 
the approximation of the objective function is obtained 
using Gaussian process (GP) priors. For this reason, 
the technique is also referred to as GP bandits (Srini- 
vas et al., 2010). However, many other approximations 
of the objective have been proposed, including Parzen 
estimators (Bergstra et al., 2011), Bayesian paramet- 
ric models (Wang & de Freitas, 2011), treed GPs (Gra- 
macy et al., 2004) and random forests (Brochu et al., 
2009; Hutter, 2009). These may be more suitable 
than GPs when the number of iterations grows without 
bound, or when the objective function is believed to 
have discontinuities. We also note that often assump- 
tions on the smoothness of the objective function are 
encoded without use of the Bayesian paradigm, while 
leading to similar algorithms and theoretical guaran- 
tees (see, for example, Bubeck et al., 2011, and the 
references therein). 

In recent years, the machine learning community has 
increasingly used Bayesian optimization (Rasmussen, 
2003; Brochu et al., 2007; Martinez-Cantin et al., 2007; 
Lizotte et al., 2007; Frazier et al., 2009; Azimi et al., 
2010; Hamze et al., 2011; Azimi et al., 2011; Bergstra 
et al., 2011; Gramacy & Poison, 2011; Denil et al., 
2012; Mahendran et al., 2012; Azimi et al., 2012; Hen- 
nig & Schuler, 2012; Marchant & Ramos, 2012). De- 
spite many success stories, the approach is restricted 
to problems of moderate dimension, typically up to 
about 10; see for example the excellent and very recent 
overview by Snoek et al. (2012). Of course, for a great 



many problems this is all that is needed. However, 
to advance the state of the art, we need to scale the 
methodology to high-dimensional parameter spaces. 
This is the goal of this paper. 

It is difficult to scale Bayesian optimization to high di- 
mensions. To ensure that a global optimum is found, 
we require good coverage of A', but as the dimension- 
ality increases, the number of evaluations needed to 
cover X increases exponentially. As a result, there has 
been little progress on this challenging problem, with 
a few exceptions. Hutter et al. (2011) used random 
forests models in Bayesian optimization to achieve 
state-of-the-art performance in optimizing up to 76 
mixed discrete/continuous parameters of algorithms 
for solving hard combinatorial problems. However, 
their method is based on frequentist uncertainty es- 
timates that can fail even for the optimization of very 
simple functions and lacks theoretical guarantees. 

In the linear bandits case, Carpentier & Munos (2012) 
recently proposed a compressed sensing strategy to at- 
tack problems with a high degree of sparsity. Also re- 
cently, Chen et al. (2012) made significant progress by 
introducing a two stage strategy for optimization and 
variable selection of high-dimensional GPs. In the first 
stage, sequential likelihood ratio tests, with a couple of 
tuning parameters, are used to select the relevant di- 
mensions. This, however, requires the relevant dimen- 
sions to be axis-aligned with an ARD kernel. Chen et 
al provide empirical results only for synthetic exam- 
ples (of up to 400 dimensions), but they provide key 
theoretical guarantees. 

Many researchers have noted that for certain classes 
of problems most dimensions do not change the ob- 
jective function significantly; examples include hyper- 
parameter optimization for neural networks and deep 
belief networks (Bergstra & Bengio, 2012) and auto- 
matic configuration of state-of-the-art algorithms for 
solving A/'T^-hard problems (Hutter, 2009). That is to 
say these problems have ^^low effective dimensional- 
ity". To take advantage of this property, Bergstra & 
Bengio (2012) proposed to simply use random search 
for optimization - the rationale being that points 
sampled uniformly at random in each dimension can 
densely cover each low-dimensional subspace. As such, 
random search can exploit low effective dimensional- 
ity without knowing which dimensions are important. 
In this paper, we exploit the same property in a new 
Bayesian optimization variant based on random em- 
beddings. 

Figure 1 illustrates the idea behind random embed- 
dings in a nutshell. Assume we know that a given 
D = 2 dimensional black-box function /(xi,X2) only 
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Unimportant 



Figure 1. This function in D=2 dimesions only has d=l 
effective dimension: the vertical axis indicated with the 
word important on the right hand side figure. Hence, the 
1-dimensional embedding includes the 2-dimensional func- 
tion's optimizer. It is more efficient to search for the opti- 
mum along the 1-dimensional random embedding than in 
the original 2-dimensional space. 

has d = 1 important dimensions, but we do not 
know which of the two dimensions is the important 
one. We can then perform optimization in the em- 
bedded 1-dimensional subspace defined by xi = X2 
since this is guaranteed to include the optimum. As 
we demonstrate in this paper, using random embed- 
dings this simple idea largely scales to arbitrary D 
and allowing us to perform Bayesian optimiza- 
tion in a low-dimensional space to optimize a high- 
dimensional function with low intrinsic dimensional- 
ity. Importantly, this trick is not restricted to cases 
with axis-aligned intrinsic dimensions but applies to 
any d-dimensional linear subspace. 

Following an explanation of GP-based Bayesian opti- 
mization (Section 2), we introduce the Random EM- 
bedding Bayesian Optimization (REMBO) algorithm 
and state its theoretical properties, including regret 
bounds that only depend on the problem's intrinsic 
dimensionality (Section 3). Our experiments (Section 
4) show that REMBO can solve problems of previously 
untenable high extrinsic dimensions. They also show 
that REMBO can achieve state-of-the-art performance 
when automatically tuning the 47 discrete parameters 
of a popular mixed integer linear programming solver. 

2. Bayesian Optimization 

As mentioned in the introduction, Bayesian optimiza- 
tion has two ingredients that need to be specified: The 
prior and the acquisition function. In this work, we 
adopt GP priors. We review GPs very briefly and re- 
fer the interested reader to the book by Rasmussen & 
Williams (2006). A GP is a distribution over func- 
tions specified by its mean function m(-) and covari- 
ance A:(-, •). More specifically, given a set of points xi:t, 
with C R^, we have 

f (Xi:t) - A/'(m(xi:^), K(X1:^, Xi:^)), 



where K(xi:t, xi-^j^^j = /c(x^,Xj) serves as the covari- 
ance matrix. A common choice of k is the squared ex- 
ponential function (see Definition 4), but many other 
choices are possible depending on our degree of belief 
about the smoothness of the objective function. 

An advantage of using GPs lies in their analytical 
tractability. In particular, given observations xi:^ with 
corresponding values fi:t, where fi = f{^i)^ and a new 
point X*, the joint distribution is given by: 



fl:t' 

/* 



A/" m(xi:0. 



K(xi:^,Xi:^) 
k(x*,Xi:t) 



k(xi:^,X*)' 

/c(x*,x*) 



For simplicity, we assume that m(xi:^) = 0. Using the 
Sherman-Morrison- Woodbury formula, one can easily 
arrive at the posterior predictive distribution: 

/*|A,x*~AA(m(x*|A),<t(x*|A)), 

with data Vt = {^i-.t^h-.t}^ mean /i(x*|Pt) = 
k(x*,xi:^)K(xi:t,xi:t)"^fi:t and variance cr(x*|D^) = 

/C(X*,X*) -k(x*,Xi:,)K(Xi:,,Xi:,)"'k(Xi:,,X*). That 

is, we can compute the posterior predictive mean /i(-) 
and variance (t(-) exactly for any point x*. 

At each iteration of Bayesian optimization, one has 
to re-compute the predictive mean and variance. 
These two quantities are used to construct the sec- 
ond ingredient of Bayesian optimization: The ac- 
quisition function. In this work, we report results 
for the expected improvement acquisition function 
u{^\Vt) = E(max{0, /t+i(x) - /(x+)}|P,) (Mockus, 
1982; Vazquez & Beet, 2010; Buh, 2011). In this 
definition, x+ = argmax^^j^^ j /(x) is the element 
with the best objective value in the first t steps of 
the optimization process. The next query is: x^+i = 
argmax-j^^;i^ ii(x|P^). Note that this utility favors the 
selection of points with high variance (points in regions 
not well explored) and points with high mean value 
(points worth exploiting). We also experimented with 
the UCB acquisition function (Srinivas et al., 2010; 
de Freitas et al., 2012) and found it to yield simi- 
lar results. The optimization of the closed-form ac- 
quisition function can be carried out by off-the-shelf 
numerical optimization procedures, such as DIRECT 
(Jones et al., 1993) and CMA-ES (Hansen & Oster- 
meier, 2001). 

The Bayesian optimization procedure is shown in Al- 
gorithm 1. 

3. Random Embedding for Bayesian 
Optimization 

Before introducing our new algorithm and its theoret- 
ical properties, we need to define what we mean by 
effective dimensionality formally. 
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Algorithm 1 Bayesian Optimization 
1: for t = 1,2, ... do 

2: Find x^+i G by optimizing the acquisition 

function u: x^+i = arg max^^;^ ii(x|Pt). 
3: Augment the data P^+i = {Vt, (x^+i, /(x^+i))} 
4: end for 



Definition 1. A function f : ^ R i5 said to have 
effective dimensionality de, with de < D, if there 
exists a linear subspace T of dimension de such that 
for all xy G T C R^ and x^ G C R^, we have 
/(x) = /(xy + x^) = /(xy); whcrc denotes the 
orthogonal complement of T . We call T the effective 
subspace of f and the constant subspace. 

This definition simply states that the function does not 
change along the coordinates x^, and this is why we 
refer to as the constant subspace. Given this def- 
inition, the following theorem shows that problems of 
low effective dimensionality can be solved via random 
embedding. 

Theorem 2. Assume we are given a function f : 
R^ R with effective dimensionality de and a ran- 
dom matrix A G R^^^ with independent entries sam- 
pled according to A/'(0, 1) and d> de- Then, with prob- 
ability 1, for any x G R^; there exists a y G R^ such 
that f{^)=f {Ay). 

Proof. Please refer to the appendix. □ 

Theorem 2 says that given any x G R^ and a random 
matrix A G R^^*^, with probability 1, there is a point 
y G R^ such that /(x) = /(Ay). This implies that 
for any optimizer x"^ G R^, there is a point y"^ G R^ 
with /(x"^) = f{Ay^). Therefore, instead of optimiz- 
ing in the high dimensional space, we can optimize the 
function ^(y) = /(Ay) in the lower dimensional space. 
This observation gives rise to our new Random EM- 
bedding Bayesian Optimization (REMBO) algorithm 
(see Algorithm 2). REMBO first draws a random em- 
bedding (given by A) and then performs Bayesian op- 
timization in this embedded space. 

In practice, we do not typically perform optimization 
across all of R^, but rather across a compact sub- 
set X C (typically a box). When REMBO se- 
lects a point y such that Ay is outside the box A*, 
it projects Ay onto A' before evaluating /. That is, 
^(y) = f{px{Ay)), where p;, : R^ ^ R^ is the 
standard projection operator for our box-constraint: 
px{y) = argmin2^;^||z - y||2; see Figure 2. We stih 
need to describe how REMBO chooses the bounded 
region 3^ C R^, inside which it performs Bayesian op- 
timization. This is important because REMBO 's ef- 
fectiveness depends on the size of y. Locating the 



Algorithm 2 REMBO: Bayesian Optimization with 
Random Embedding 

1: Generate a random matrix A G R^^'^ 
2: Choose the bounded region set 3^ C R^ 
3: for t = 1,2, ... do 

4: Find yt+i G R^ by optimizing the acquisition 
function u: y^+i = arg maXy^^ ix(y |Pt). 

5: Augment the data = 
{A,(y,+i,/(Ay,+i)} 

6: Update the kernel hyper-parameters. 

7: end for 



A 




Figure 2. Embedding from d — 1 into D — 2. The box 
illustrates the 2D constrained space Af, while the thicker 
red line illustrates the ID constrained space 3^. Note that 
if Ay is outside Af, it is projected onto X. The set y 
must be chosen large enough so that the projection of its 
image, A3^, onto the effective subspace (vertical axis in this 
diagram) covers the vertical side of the box. 

optimum within y is easier if y is small, but if we 
set y too small it may not actually contain the global 
optimizer. In the following theorem, we show that we 
can choose 3^ in a way that only depends on the effec- 
tive dimensionality de such that the optimizer of the 
original problem is contained in the low dimensional 
space with constant probability. 

Theorem 3. Suppose we want to optimize a function 
f : R^ R with effective dimension de < d sub- 
ject to the box constraint X C R^^ where X is cen- 
tered around 0. Let us denote one of the optimizers 
by x"^. Suppose further that the effective subspace T 
of f is such that T is the span of de basis vectors. 
Let Xy G T n A' be an optimizer of f inside T. If A 
is a D X d random matrix with independent standard 
Gaussian entries, there exists an optimizer y'^ G R^ 
such that /(Ay*) = /(x^) and \\y*h < "^W^jh 
with probability at least 1 — e. 

Proof. Please refer to the appendix. □ 
Theorem 3 says that if the set X in the original space 



Bayesian Optimization in a Billion Dimensions 



is a box constraint, then there exists an optimizer 
Xy G A* that is dg-sparse such that with probability at 

least 1-e, ||y*||2 < ^||x^||2 where /(Ay*) = /(x* ). 
If the box constraint is X = [—1, 1]^ (which is always 
achievable through rescaling), we have with probabil- 
ity at least 1 — e that 



the corresponding squared exponential kernel as 



Hence, to choose 3^, we just have to make sure that the 
ball of radius de/e satisfies (0, ^) C J^. In most prac- 
tical scenarios, we found that the optimizer does not 
fall on the boundary which implies that ||xy||2 < de- 
Thus setting y too big may be unnecessarily wasteful; 
in all our experiments we set y to be [— a/S, y/df. 

3.1. Increasing the Success Rate of REMBO 

Theorem 3 only guarantees that y contains the op- 
timum with probability at least 1 — e; with proba- 
bility ^ < e the optimizer lies outside of y. There 
are several ways to guard against this problem. One 
is to simply run REMBO multiple times with differ- 
ent independently drawn random embeddings. Since 
the probability of failure with each embedding is J, 
the probability of the optimizer not being included in 
the considered space of k independently drawn embed- 
dings is 5^. Thus, the failure probability vanishes ex- 
ponentially quickly in the number of REMBO runs, k. 
Note also that these independent runs can be trivially 
parallelized to harness the power of modern multi-core 
machines and large compute clusters. 

Another way of increasing REMBO 's success rate is to 
increase the dimensionality d it uses internally. When 
d > de^ with probability 1 we have different em- 
beddings of dimensionality d^. That is, we only need 
to select de columns of A G R^^^ to represent the de 
relevant dimensions of x. We can do this by choosing 
de sub-components of the d-dimensional vector y while 
setting the remaining components to zero. Informally, 
since we have more embeddings, it is more likely that 
one of these will include the optimizer. In our exper- 
iments, we will assess the merits and shortcomings of 
these two strategies. 

3.2. Choice of Kernel 

Since REMBO uses GP-based Bayesian optimization 
to search in the region y C M'^, we need to define 
a kernel between two points y*^^\y*^^^ G y. We begin 
with the standard definition of the squared exponential 
kernel: 

Definition 4. Given a length scale £ > 0, we define 



exp 



/(2)||2 



2P 



It is possible to work with two variants of this kernel. 
First, we can use kf{y^^y'^) as in Definition 4. We 
refer to this kernel as the low-dimensional kernel. We 
can also adopt an implicitly defined high-dimensional 
kernel on X\ 



fef(y^'\y') = exp 



2P 



(2)^ 



where px - is the projection operator for 

our box-constraint as above (see Figure 2). 

Note that when using this high-dimensional kernel, we 
are fitting the GP in D dimensions. However, the 
search space is no longer the box A', but it is instead 
given by the much smaller subspace {px{Ay) : y G 
y}. Importantly, in practice it is easier to maximize 
the acquisition function in this subspace. 

Both kernel choices have strengths and weaknesses. 
The low-dimensional kernel has the benefit of only hav- 
ing to construct a GP in the space of intrinsic dimen- 
sionality whereas the high-dimensional kernel has to 
construct the GP in a space of extrinsic dimensional- 
ity D. However, the low-dimensional kernel may waste 
time exploring in the region of the embedding outside 
of X (see Figure 2) because two points far apart in 
this region may be projected via px nearby points 
on the boundary of X. The high-dimensional kernel 
is not affected by this problem because the search is 
conducted directly on {px{Ay) : y e y}. 

The choice of kernel also depends on whether our vari- 
ables are continuous, integer or categorical. The cat- 
egorical case is important because we often encounter 
optimization problems that contain discrete choices. 
We define our kernel for categorical variables as: 



(2)^ 



A 



)=exp(-^M^(Ay(i)),.(A/^^)) 



/2)> 



where y^^\y^'^'^ G y C and h defines the distance 
between 2 vectors. The function s maps continuous 
vectors to discrete vectors. In more detail, ^(x) first 
projects X to [—1,1]^ to generate x. For each di- 
mension Xi of X, 8 then map Xi to the correspond- 
ing discrete parameters by scaling and rounding. In 
our experiments, following Hutter (2009), we defined 
/i(x^-'-\ x^^^) = \{i : x[^^ 7^ so as not to impose 

an artificial ordering between the values of categori- 
cal parameters. In essence, we measure the distance 
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between two points in the low-dimensional space as 
the Hamming distance between their mappings in the 
high-dimensional space. 

3.3. Regret Bounds 

When using the high-dimensional kernel on 
{px{-^y) : y G 3^} C A', we could easily apply pre- 
vious theoretical results (Srinivas et al., 2010; Bull, 
2011; de Freitas et al., 2012). However, this is not sat- 
isfying since the rates of convergence would still de- 
pend on D. If the low-dimensional embedding cap- 
tures the optimizer, and since the search is conducted 
in {pxi-^y) y ^ y} instead of A', we should expect 
faster rates of convergence that only depend on the 
size of the embedding's dimensionality. The rest of 
this section shows that it is indeed possible to obtain 
rates of convergence that only depend on the embed- 
ding's dimensionality. 

We begin our mathematical treatment with the defini- 
tions of simple regret and the skew squared exponential 
(SSE) kernel. 

Definition 5. Given a function / : A' ^ R and a 
sequence of points {xt}^i C X, the simple regret on 
the set X at time T is defined to he rf{T) = sup;^^ / — 

max/(x^). 

Definition 6. Given a symmetric, positive- definite 
matrix A, we define the corresponding skew squared 
exponential kernel using the formula 

Given A, i (as in the squared exponential kernel 
k^) and X C R^, we denote the Reproducing Ker- 
nel Hilbert Spaces (RKHSs) corresponding to k^ and 
ki by Ha(^) and ^^(A'), respectively (Steinwart & 
Christmann, 2008, Definition 4.18). Moreover, given 
an arbitrary kernel /c, we will denote its RKHS hy Hk- 

Our main result below shows that the simple regret 
vanishes with rate 0{t~d^ with high probability when 
we use the squared exponential kernel. Note that we 
only make the assumption that the cost function re- 
stricted to T is governed by a skew symmetric ker- 
nel, a much weaker assumption than the standard as- 
sumption that the cost function is governed by an axis 
aligned kernel in D dimensions (see, e.g.. Bull, 2011). 

Theorem 7. Let X C R^ be a compact subset with 
non-empty interior that is convex and contains the ori- 
gin and f : X ^ a function with effective dimen- 
sion d. Suppose that the restriction of f to its effective 
subspace T, denoted f\q-, is an element of the RKHS 
T-La(^^) with A symmetric and positive definite and 



also satisfying < < Amin(A) < Amax(A) < for 
constants r and R. 

Let A be a D X d matrix, whose elements are drawn 
from the normal distribution ^A/'(0, 1). Given any 
e > 0, we can choose a length-scale £ = £{e) such that 
running Expected Improvement with kernel k^ on the 
restriction of f to the image of A inside X would have 
simple regret in 0{t~d^ with probability 1 — e. 

This theorem does not follow directly from the results 
of Bull (2011), since the kernel is not aligned with 
the axes, both in the high-dimensional space and the 
lower dimensional embedding. Thus, even given the 
true hyper- parameter the aforementioned paper will 
not entail a convergence result. 

Please refer to the appendix for the proof of this theo- 
rem. The general idea of the proof is as follows. If we 
have a squared exponential kernel /c^, with a smaller 
length scale than a given kernel /c, then an element / 
of the RKHS of k is also an element of the RKHS of 
k£. So, when running expected improvement, one can 
safely use ki instead of k as the kernel and still get a 
regret bound. Most of the proof is dedicated to find- 
ing a length scale i that fits "underneath" our kernel, 
so we can replace our kernel with /c^, to which we can 
apply the results of Buh (2011). 

The above theorem requires the embedded dimension 
and the effective dimension to coincide, but due to 
Proposition 1 in (de Freitas et al., 2012), we strongly 
believe that the analysis in (Bull, 2011) can be modi- 
fied to allow for situations in which some of the ARD 
parameters are zero, which is precisely what is pre- 
venting us from extending this result to the case where 
d > de. 

3.4. Hyper-parameter Optimization 

For Bayesian optimization (and therefore REMBO), 
it is difficult to manually estimate the true length 
scale hyper-parameter of a problem at hand. To avoid 
any manual steps and to achieve robust performance 
across diverse sets of objective functions, in this paper 
we adopted an adaptive hyper-parameter optimization 
scheme. The length scale of GPs is often set by max- 
imizing marginal likelihood (Rasmussen & Williams, 
2006; Jones et al., 1998). However, as demonstrated by 
Bull (2011), this approach, when implemented naively, 
may not guarantee convergence. Here, we propose to 
optimize the length scale parameter £ by maximizing 
the marginal likelihood subject to an upper bound U 
which is decreased when the algorithm starts exploit- 
ing too much. Full details are given in Algorithm 3. 
We say that the algorithm is exploiting when the stan- 
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Algorithm 3 Bayesian Optimization with Hyper- 
parameter Optimization, 
input Threshold t^r- 

input Upper and lower bounds U > L > for hyper- 
parameter. 

input Initial length scale hyper-parameter £ G [L^U]. 
1: Initialize C = 
2: for t = 1,2, ... do 

3: Find x^+i by optimizing the acquisition function 

4: if V'cr(x^+i) < then 

5: C = C+1 

6: else 

7: C = 

8: end if 

9: Augment the data P^+i = {Vt, (x^+i, /(x^+i))} 
10: if i mod 20 = or C = 5 then 
11: if C = 5 then 
12: U = max{0.9^, L} 

13: C = 

14: end if 

15: Learning the hyper-parameter by optimizing 
the log marginal likelihood by using DIRECT: 
£ = arg max^^[^^^] logp(fi:t+i |xi:t+i , /) 

16: end if 

17: end for 



dard deviation at the maximizer of the acquisition 
function ^/a{x^^ is less than some threshold for 
5 consecutive iterations. Intuitively, this means that 
the algorithm did not emphasize exploration (search- 
ing in new parts of the space, where the predictive un- 
certainty is high) for 5 consecutive iterations. When 
this criterion is met, the algorithm decreases its upper 
bound U multiplicatively and re-optimizes the hyper- 
parameter subject to the new bound. Even when 
the criterion is not met the hyper-parameter is re- 
optimized every 20 iterations. 

The motivation of this algorithm is to rather err on 
the side of having too small a length scale. ^ Given a 
squared exponential kernel /c^, with a smaller length 
scale than another kernel /c, one can show that any 
function / in the RKHS characterized by k is also an 
element of the RKHS characterized by k£. So, when 
running expected improvement, one can safely use k£ 
instead of k as the kernel of the GP and still preserve 
convergence (Bull, 2011). We argue that (with a small 
enough lower bound L) the algorithm would eventually 
reduce the upper bound enough to allow convergence. 
Also, the algorithm would not explore indefinitely as 
L is required to be positive. In our experiments, we 



set the initial constraint [L, U] to be [0.01,50] and set 
= 0.002. 

4. Experiments 

We now study REMBO empirically. We first use 
synthetic functions of small intrinsic dimensionality 
de = 2 but extrinsic dimension D up to 1 billion to 
demonstrate REMBO's independence of D. Then, we 
apply REMBO to automatically optimize the 47 pa- 
rameters of a widely-used mixed integer linear pro- 
gramming solver and demonstrate that it achieves 
state-of-the-art performance. However, we also warn 
against the blind application of REMBO. To illustrate 
this, in the appendix we study REMBO's performance 
for tuning the 14 parameters of a random forest body 
part classifier used by Kinect. In this application, all 
the D = 14: parameters appear to be important, and 
while REMBO (based on d = 3) finds reasonable solu- 
tions (better than random search and comparable to 
what domain experts achieve), standard Bayesian op- 
timization can outperform REMBO (and the domain 
experts) in such moderate-dimensional spaces. 

4.1. Experimental Setup 

For all our experiments, we used a single robust ver- 
sion of REMBO that automatically sets its GP's length 
scale parameter as described in Section 3.4. For each 
optimization of the acquisition function, this version 
runs both DIRECT (Jones et al., 1993) and CMA- 
ES (Hansen & Ostermeier, 2001) and uses the result 
of the better of the two. The code for REMBO, as 
well as all data used in our experiments will be made 
publicly available in the near future. 

Some of our experiments required substantial com- 
putational resources, with the computational expense 
of each experiment depending mostly on the cost of 
evaluating the respective blackbox function. While 
the synthetic experiments in Section 4.2 only required 
minutes for each run of each method, optimizing the 
mixed integer programming solver in Section 4.3 re- 
quired 4-5 hours per run, and optimizing the random 
forest classifier in Appendix D required 4-5 days per 
run. In total, we used over half a year of CPU time 
for the experiments in this paper. 

In each experiment, we study the effect of our two 
methods for increasing REMBO's success rate (see 
Section 3.1) by running different numbers of indepen- 
dent REMBO runs with different settings of its inter- 
nal dimensionality d. 



similar idea is exploited in the proof of Theorem 7. 



Bayesian Optimization in a Billion Dimensions 



10 

5 

4 

2 

1 



d -- 



d - 



0.0022 ± 0.0035 
0.0004 ± 0.0011 
0.0001 ± 0.0003 
0.1514 ± 0.9154 
0.7406 ± 1.8996 



0.1553 ± 0.1601 
0.0908 ± 0.1252 
0.0654 ± 0.0877 
0.0309 ± 0.0687 
0.0143 ± 0.0406 



0.4865 ± 0.4769 
0.2586 ± 0.3702 
0.3379 ± 0.3170 
0.1643 ± 0.1877 
0.1137 ± 0.1202 



Table 1. Optimality gap for de — 2- dimensional Branin 
function embedded in D = 25 dimensions, for REMBO 
variants using a total of 500 function evaluations. The 
variants differed in the internal dimensionality d and in 
the number of interleaved runs k (each such run was only 
allowed 500/ A; function evaluations). We show mean and 
standard deviations of the optimality gap achieved after 
500 function evaluations. 



4.2. Bayesian Optimization in a Billion 
Dimensions 

In this section, we add empirical evidence to our the- 
oretical finding from Section 3 that REMBO 's perfor- 
mance is independent of the extrinsic dimensionality 
D when using the low-dimensional kernel k^{y^^y'^) 
from Definition 4. Specifically, using synthetic data, 
we show that when using that kernel REMBO has 
no problem scaling up to as many as 1 billion dimen- 
sions. We also study REMBO 's invariance properties 
and empirically evaluate our two strategies for increas- 
ing its success probability. 

The experiments in this section employ a standard 
de = 2-dimensional benchmark function for Bayesian 
optimization, embedded in a D-dimensional space. 
That is, we add D — 2 additional dimensions which do 
not affect the function at all. More precisely, the func- 
tion whose optimum we seek is /(xi:i)) = b{xi^Xj)^ 
where b is the Branin function (see Lizotte, 2008, 
for its exact formula), and where dimensions i and 
j are selected once using a random permutation of 
To measure the performance of each op- 
timization method, we used the optimality gap: the 
difference of the best function value it found and the 
optimal function value. 

We first study the effectiveness of the two techniques 
for increasing REMBO's success probability that we 
proposed in Section 3.1. To empirically study the ef- 
fectiveness of using internal dimensionalities d > de-, 
and of interleaving multiple independent runs, /c, we 
ran REMBO with all combinations of three different 
values of d and k. The results in Table 1 demonstrate 
that both techniques helped improve REMBO's per- 
formance, with interleaved runs being the more effec- 
tive strategy. We note that in 13/50 REMBO runs, the 
global optimum was indeed not contained in the box 3^ 
that REMBO searched with d = 2; this is the reason 
for the poor mean performance of REMBO with d = 2 



and k = 1. However, the remaining 37 runs performed 
very well, and REMBO thus performed well when us- 
ing multiple interleaved runs: with a failure rate of 
13/50=0.26 per independent run, the failure rate us- 
ing /c = 4 interleaved runs is only 0.26"^ ~ 0.005. One 
could easily achieve an arbitrarily small failure rate by 
using many independent parallel runs. Here we evalu- 
ated all REMBO versions using a fixed budget of 500 
function evaluations that is spread across multiple in- 
terleaved runs — for example, when using k = A inter- 
leaved REMBO runs, each of them was only allowed 
125 function evaluations. The results show that per- 
forming multiple independent runs nevertheless sub- 
stantially improved REMBO's performance. Using a 
larger d was also effective in increasing the probabil- 
ity of the optimizer falling into REMBO's box y but 
at the same time slowed down REMBO's convergence 
(such that interleaving several short runs lost its effec- 
tiveness). We conclude that using several interleaved 
runs of REMBO with small d > de performs best. 

Next, we compared REMBO to standard Bayesian op- 
timization (BO) and to random search, for an extrin- 
sic dimensionality of D = 25. Standard BO is well 
known to perform well in low dimensions, but to de- 
grade above a tipping point of about 15-20 dimensions. 
Our results for D = 25 (see Figure 3, left) confirm 
that BO performed rather poorly just above this criti- 
cal dimensionality (merely tying with random search). 
REMBO, on the other hand, still performed very well 
in 25 dimensions. 

Since REMBO is independent of the extrinsic dimen- 
sionality D as long as the intrinsic dimensionality de 
is small, it performed just as well in D = 1 000 000 000 
dimensions (see Figure 3, middle). To the best of our 
knowledge, the only other existing method that can be 
run in such high dimensionality is random search. 

Finally, one important advantage of REMBO is that 
— in contrast to the approach of Chen et al. (2012) — 
it does not require the effective dimension to be coordi- 
nate aligned. To demonstrate this fact empirically, we 
rotated the embedded Branin function by an orthogo- 
nal rotation matrix R G R^^^. That is, we replaced 
/(x) by /(Rx). Figure 3 (right) shows that REMBO's 
performance is not affected by this rotation. 

4.3. Automatic Configuration of a Mixed 
Integer Linear Programming Solver 

State-of-the-art algorithms for solving hard computa- 
tional problems tend to parameterize several design 
choices in order to allow a customization of the al- 
gorithm to new problem domains. Automated meth- 
ods for algorithm configuration have recently demon- 
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Figure 3. Comparison of random search (RANDOM), standard Bayesian optimization (BO), and REMBO. Left: D — 25 
extrinsic dimensions; Middle: D = 1000 000 000 extrinsic dimensions; Right: D = 25, with a rotated objective function. 
For each method, we plot means and 1/4 standard deviation confidence intervals of the optimality gap across 50 trials. 



strated that substantial performance gains of state- 
of-the-art algorithms can be achieved in a fully au- 
tomated fashion (Mockus et al., 1999; Hutter et al., 
2010; Bergstra et al., 2011; Wang & de Freitas, 2011). 
These successes have led to a paradigm shift in algo- 
rithm development towards the active design of highly 
parameterized frameworks that can be automatically 
customized to particular problem domains using opti- 
mization (Hoos, 2012; Bergstra et al., 2012). 

It has long been suspected that the resulting algorithm 
configuration problems have low dimensionality (Hut- 
ter, 2009). Here, we demonstrate that REMBO can 
exploit this low dimensionality even in the discrete 
spaces typically encountered in algorithm configura- 
tion. We use a configuration problem obtained from 
Hutter et al. (2010), aiming to configure the 40 bi- 
nary and 7 categorical parameters of Ipsolve^, a pop- 
ular mixed integer programming (MIP) solver that 
has been downloaded over 40 000 times in the last 
year. The objective is to minimize the optimality gap 
Ipsolve can obtain in a time limit of five seconds for 
a MIP encoding of a wildlife corridor problem from 
computational sustainability (Gomes et al., 2008). Al- 
gorithm configuration usually aims to improve perfor- 
mance for a representative set of problem instances, 
and effective methods need to solve two orthogonal 
problems: searching the parameter space effectively 
and deciding how many instances to use in each eval- 
uation (to trade off computational overhead and over- 
fitting) . Our contribution is for the first of these prob- 
lems; to focus on how effectively the different methods 
search the parameter space, we only consider configu- 
ration on a single problem instance. 

Due to the discrete nature of this optimization 
problem, we could only apply REMBO using 
the high-dimensional kernel for categorical variables 

^ htt p : / / Ipsolve . sour ceforge . net / 



(y^^\y^^^) described in Section 3.2. While we have 
not proven any theoretical guarantees for discrete op- 
timization problems, REMBO appears to effectively 
exploit the low effective dimensionality of at least this 
particular optimization problem. 

As baselines for judging our performance in config- 
uring Ipsolve, we used the configuration procedures 
ParamlLS (Hutter et al., 2009) and SMAC (Hutter 
et al., 2011). ParamlLS and SMAC have been specifi- 
cally designed for the configuration of algorithms with 
many discrete parameters and yield state-of-the-art 
performance for this task. 

As Figure 4.3 (top) shows, ParamlLS and SMAC in- 
deed outperformed random search and BO. However, 
remarkably, our vanilla REMBO method performed 
even slightly better. While the figure only shows 
REMBO with d = 5 to avoid clutter, we by no means 
optimized this parameter; the only other value we tried 
was d = 3, which resulted in indistinguishable perfor- 
mance. 

As in the synthetic experiment, REMBO 's perfor- 
mance could be further improved by using multiple 
interleaved runs. However, as shown by Hutter et al. 
(2012), multiple independent runs can also improve 
the performance of SMAC and especially ParamlLS. 
Thus, to be fair, we re-evaluated all approaches using 
interleaved runs. Figure 4.3 (bottom) shows that when 
using k = 4 interleaved runs of 500 evaluations each, 
REMBO and ParamlLS performed best, with a slight 
advantage for REMBO early on in the search. 

5. Conclusion 

We have demonstrated that it is possible to use ran- 
dom embeddings in Bayesian optimization to optimize 
functions of extremely high extrinsic dimensionality 
D provided that they have low intrinsic dimension- 
ality de. Our resulting REMBO algorithm is coor- 
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Figure 4. Performance of various methods for configuration 
of Ipsolve; we show the optimality gap Ipsolve achieved 
with the configurations found by the various methods 
(lower is better). Top: a single run of each method; Bot- 
tom: performance with k — A interleaved runs. 

dinate independent and has provable regret bounds 
that are independent of the extrinsic dimensionality 
D. Moreover, it only requires a simple modification of 
the original Bayesian optimization algorithm; namely 
multiplication by a random matrix. We confirmed 
REMBO's independence of D empirically by optimiz- 
ing low-dimensional functions embedded in previously 
untenable extrinsic dimensionalities of up to 1 billion. 
Finally, we demonstrated that REMBO achieves state- 
of-the-art performance for optimizing the 47 discrete 
parameters of a popular mixed integer programming 
solver, thereby providing further evidence for the ob- 
servation (already put forward by Bergstra, Hutter 
and colleagues) that, for many problems of great prac- 
tical interest, the number of important dimensions in- 
deed appears to be much lower than their extrinsic 
dimensionality. 
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A. Proof of Theorem 2 

Proof. Since / has effective dimensionality dg, there exists an effective subspace T C M^, such that rank(T) = de. 
Furthermore, any x G decomposes as x = xy+x^, where xy G T and x^ G T^. Hence, /(x) = /(xy+xx) = 
/(xy). Therefore, without loss of generality, it will suffice to show that for all xy G T, there exists a y G 
such that /(xy) = /(Ay). 

Let $ G R^><^e matrix, whose columns form an orthonormal basis for T. Hence, for each xy G T, there 
exists a c G M'^'^ such that xy = $c. Let us for now assume that has rank de. If has rank (ig, there 



exists a y such that ($ A)y 



c. The orthogonal projection of Ay onto T is given by 

$C = Xy. 



$$^Ay 



Thus Ay = Xy +x' for some x' G since xy is the projection Ay onto T. Consequently, /(Ay) = /(xy +x') = 
/(xt). 

It remains to show that, with probability one, the matrix ^"^A has rank de- Let A^ G M^^^^e submatrix 
of A consisting of any de columns of A, which are i.i.d. samples distributed according to A/'(0,I). Then, ^"^a^ 
are i.i.d. samples from A/'(0, $^$) = JV{Od^^Id^xde)i so we have ^^Ag, when considered as an element 
of M^e^ is a sample from A/'(0c^2, 1^2x^2). On the other hand, the set of singular matrices in R'^e has Lebesgue 
measure zero, since it is the zero set of a polynomial (i.e. the determinant function) and polynomial functions are 
Lebesgue measurable. Moreover, the Normal distribution is absolutely continuous with respect to the Lebesgue 
measure, so our matrix ^"^Ag is almost surely non-singular, which means that it has rank d^ and so the same 
is true of ^"^A, whose columns contain the columns of $"^Ag. □ 



B. Proof of Theorem 3 

Proof. Since A' is a box constraint, by projecting x"^ to T we get Xy G T fl A*. Also, since x* = Xy + x^ for 
some x^ G T^, we have /(x"^) = /(xy). Hence, Xy is an optimizer. By using the same argument as appeared in 
Proposition 1, it is easy to see that with probability 1 Vx G T 3y G such that Ay = x + x^ where x^ G T"^. 
Let ^ be the matrix whose columns form a standard basis for T. Without loss of generality, we can assume that 

^ 

Then, as shown in Proposition 2, there exists a y"^ G M'^ such that Ay^ = Xy. Note that for each column 
of A, we have 



^^^a^ - A/" 0, 



0^ 

Therefore ^^"^Ay"*" = Xy is equivalent to By"^ = Xy where B G R^eXde ^ random matrix with independent 
standard Gaussian entries and Xy is the vector that contains the first d^ entries of Xy (the rest are O's). By 
Theorem 3.4 of (Sankar et al., 2003), we have 



IB 



> 



/dp 



< e. 



Thus, with probability at least 1 - e, ||y*|| < ||B-^ II2 ||x^ II2 



|2||X^||2 



< 



|Xy||2. 



□ 



C. Proof of Theorem 7 

Before embarking on the proof of Theorem 7, we introduce some definitions and state a few preliminary results, 
which we quote from (Steinwart & Christmann, 2008) and (Bull, 2011) to facilitate the reading of this exposition. 

Definition 8. Given a map ir : S ^ T between any two sets S and T, and any map f : T x - ■ - x T ^ with 

^ V ^ 

n-times 

n> 1, we define the pull-back of f under tt as follows: 

7r*/(5i, . . . , Sn) := /(ttsi, . . . , yr^n). 
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That is one evaluates the pull-back tt*/ on points in S by first '^pushing them forward^' onto T and then using f 
to get a number. 

Also, if the map tt is given by a matrix A, we will use the notation A*/ for the pull-back of f under the linear 
map induced by A. Moreover, given a matrix A and a set S in its target space, we will denote by A~^{S) the 
set of all points that are mapped into S by A. 

Lemma 9 (Lemma 4.6 in (Steinwart & Christmann, 2008)). Let ki be a kernel on X\ and k2 be a kernel on X2. 
Then k\ ■ k2 is a kernel on X\ x X2. 

Remark 10. In the proof of the above lemma in (Steinwart & Christmann, 2008), it is argued that if Hi is the 
RKHS of ki for i = 1^2, then Hi is the RKHS of ki ■ k2, where is the tensor product of Hilbert spaces 
and the elements have the form h{xi,X2) = ^i fi{xi)gi{x2) , where the functions : A'l ^ R are 

elements of Hi and the functions : ^ M are elements ofl-L2' 

Corollary 11 (Corollary 4.43 in (Steinwart & Christmann, 2008)). Given X '^W^ with non-empty interior, 
and r > {), then we have an isomorphism of Hilbert spaces l-Lr{X) = Hr(M^); through an extension operator Ix 
(whose definition is omitted, since it is not needed here). 

Proposition 12 (The second part of Proposition 4.46 in (Steinwart & Christmann, 2008)). Given < i < U , 
for all non-empty X CW^^ the RKHS l-Lu{X) can be mapped under the identity map into l-Li{X) and we have 
the following bound: 

d 

'IP 

t 

where the norm in the equation is the operator norm, i.e. supj^^^ • 

Remark 13. Note that here, the map id signifies the following: the element f G Hu corresponds to a real-valued 
function on X , which we will also denote by f , so one can pose the question whether or not this function is an 
element of Hi as well, and the existence of the map id : Hu Hi implies that f is indeed an element of Hi. 
Equivalently, Hu C and 

d 

\\f\\n,<(jy WfWnu- 

In the proof of our theorem below, we will extend this result for squared exponential kernels to skewed squared 
exponential kernels. 

Proposition 14 (Theorem 2 in (Bull, 2011), paraphrased for our particular setting). Given a squared exponential 
kernel ki on a compact subset y C and a function f G 1-Li{y), then applying Expected Improvement to f 
results in simple regret that diminishes according to 0{t~dy with the constants worsening as the norm \\f\\^^(^y^ 
increases. 



Wid'.Hu^neW < 



Proof of Theorem 7. Let U : X ^ T denote the (unknown) orthogonal projection onto the effective subspace 
of /; we will also denote the corresponding matrix by U. (Please refer to the right hand side of Figure 1 for 
an illustration of a 2-dimensional space, a 1-dimensional embedding (slanted line) and a 1-dimensional effective 
space (vertical axis).) 

Recall from the theorem statement that /|r is assumed to be an element of the RKHS Ha, and that we have 
/ = n*/|7-, i.e. / is obtained from "stretching /I7- open" along the orthogonal subspace of T. From this, we 
can conclude that / is an element of H^d, with k^ := n*/cA, where /ca is the kernel on the effective subspace T. 

Now, given the embedding ^ defined by the matrix A, the pull-back function A*/ is an element of the 
RKHS Ha*/cO- Henceforth, we will use the notation 

k"^ := A*k^ = A*n*i^A = (nA)*i^A. 



In the remainder of this proof, we replace k^ with a squared exponential kernel k£ that is "thinner" than k^ and 
so A*/ is also an element of the RKHS of k^. By showing that this is true, REMBO (which uses ki) has enough 
approximation power. Moreover, the statement of Proposition 14 applies. 



Bayesian Optimization in a Billion Dimensions 



Since is constant along T^, we get that is in turn constant along {^^)^ since A is randomly 
chosen, the linear subspace (^^) will almost surely be zero dimensional. Let us introduce the notation 11 
for the d X D matrix that projects vectors in onto the effective subspace T. 

We almost surely have that the dxd matrix 11 A is non-singular, since the space of singular matrices has measure 
0, given the fact that it is the zero-set of a polynomial, namely the determinant. Therefore, the kernel /c^, which 
has the form 

is also SSE, since A^XI^A^^IIA is symmetric and positive definite, simply because of the symmetry and positive 
definiteness of A~^: given y ^ 0, we have 

y^A^n^A-^nAy = y^A-^y > 0, 

where y := IIAy ^ since IIA is invertible. In what follows, we will use the notation A^^ := (A^XI^A^^IIA) ^ . 

Since 11 is an orthogonal projection matrix, it has an SVD decomposition 11 = USV consisting of an orthogonal 
dxd matrix U, an orthogonal D x D matrix V and a d x D matrix S that has the following form: 



S = 



1 ••• ••• 
1 ••• ••• 

••• 10 ••• 



. Now, given a fixed orthogonal matrix O and a random Gaussian vector v ~ A/'(0, Idxd)^ due to the rotational 
symmetry of the normal distribution, the vector Ov is also a sample from A/'(0,Idxd)- Therefore, given O as 
above, and a random Gaussian matrix F, then OF is also a random Gaussian matrix with the same distribution 
of entries. Moreover, given S as above, SF is a d x D random Gaussian matrix, since multiplying any matrix by 
S on the left simply extracts the first d rows of the matrix. 

Given this, if we fix an orthogonal decomposition A~^ = P^D~^P, where P is orthogonal and D is a diagonal 
matrix with the eigenvalues of A along the diagonal, we can conclude that 

G := PnA = PUSVA 

is a random Gaussian matrix, and so the matrix A^^ can be decomposed into random Gaussian and diagonal 
matrices as follows: 

A-^ = G^D-^G. 

Let 

5min Smax dcuotc the Smallest and the largest singular values of a matrix. With this notation in hand, 
we point out the following two facts about concentration of singular values: 

L Since for any pair of matrices A and B, we have Smax(AB) < Smax(A)5max(B), we get 

~ AmaxCA^"*") < 5max(G)^5max(D ^) < ^ — ^— 



(A.) 

and since G is a random matrix with Gaussian entries, we have (cf. Equation 2.3 in (Rudelson & Vershynin, 
2010)) 

P (w(G) < 2V^ + < 1 - 2e-^'/^ 
and so with probability 1 — |, we have 



W(G) <2v^+^/21nl 
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Therefore, with probabihty 1 — |, we have 



Amin(A,) > I ^= 1 . (1) 

2V^+ W21n- 



Henceforth, we wih use the notation 

^ = ^{e) := \ : 

II. On the other hand, we have 



Amax(Arf) ' ^ ' 

together with the fohowing probabihstic bound on Smin(G) (cf. Equation 3.2 in (Rudelson & Vershynin, 
2010)): 

P (^5min(G) > -^^ >l-8. 

So, with probabihty 1 — |, we have 

(G)> 



2V^' 

and so 



holds with probability 1 — f • 

In what follows, we will use the notation: 



Amax(Arf) < (2) 



f/ = f/(e):=^ 



Now, with these estimates in hand, we can go ahead and show that the following bound holds with probability 
1-e: 

l|A*/lk.(A-w)<(^)'ll/lrlk.(r) (3) 
This claim follows from the following sequence of facts: 

A. Since the transformation IIA is invertible, we have that the map (IIA)* : 1-L\{W^) 1-La^(W^) (recall that 

= ^(nA)*A) that sends g G Ha to (IIA)*^ is an isomorphism of Hilbert spaces and so 

||/|r|k,(R.) = ||A*/lk.,(R^) (4) 
since we have A*/ = A* (n*/|r) = (nA)*/|r. 

B. If we denote the eigenvalues of A^ by Ai < • • • < A^^, then, first of all, by Corollary 11, we have 

and also by Remark 10, we have the isomorphisms 
where (g) denotes the tensor product of Hilbert spaces. 
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Figure 5. Left: ground truth depth, ground truth body parts and predicted body parts; Right: features specified by offsets 
u and V. 



C. Using Equations (1) and (2) together with Proposition 12, we can conclude that 



l|A*/lk.( 



< 



< 



i 



-Mai(R)®---i8)Wa^ 



U 



U 



7 1 l|A*/lkA^(R^), 



(5) 



where the two inequalities are true with probability 1 — f each, and so they both hold with probability 1 — e. 

Composing the Inequality (5) with Equality (4) gives us the bound claimed in Inequality (3). 

Now that we know that the ^^(M^) norm of A*/ is finite, we can apply the Expected Improvement algorithm 
to it on the set A~^(A') with kernel /c^, instead of the unknown kernel /ca^, and then Proposition 14 tells us that 
the simple regret would be in □ 



D. Automatic Configuration of Random Forest Kinect Body Part Classifier 

We now present an additional experiment evaluating REMBO's performance for optimizing the 14 parameters of 
a random forest body part classifier. This classifier closely follows the proprietary system used in the Microsoft 
Kinect (Shotton et al., 2011) and will be publicly released in the near future. 

We begin by describing some details of the dataset and classifier in order to build intuition for the objective 
function and the parameters being optimized. The data we used consists of pairs of depth images and ground 
truth body part labels. Specifically, we used 1 500 pairs of 320x240 resolution depth and body part images, each 
of which was synthesized from a random pose of the CMU mocap dataset. Depth, ground truth body parts and 
predicted body parts (as predicted by the classifier described below) are visualized for one pose in Figure 5 (left). 
There are 19 body parts plus one background class. For each of these 20 possible labels, the training data 
contained 25 000 pixels, randomly selected from 500 training images. Both validation and test data contained 
all pixels in the 500 validation and test images, respectively. 

The random forest classifier is applied to one pixel P at a time. At each node of each of its decision trees, it 
computes the depth difference between two pixels described by offsets from P and compares this to a threshold. 
At training time, many possible pairs of offsets are generated at random, and the pair yielding highest information 
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gain for the training data points is selected. Figure 5 (right) visualizes a potential feature for the pixel in the 
green box: it computes the depth difference between the pixels in the red box and the white box, specified by 
respective offsets u and v. At training time, u and v are drawn from two independent 2-dimensional Gaussian 
distributions, each of which is parameterized by its two mean parameters /ii and fi2 and three covariance terms 
^11, ^12, and 1122 (^21 = ^12 because of symmetry). These constitute 10 of the parameters that need to be 
optimized, with range [-50,50] for the mean components and [1, 200] for the covariance terms. Low covariance 
terms yield local features, while high terms yield global features. Next to these ten parameters, the random 
forest classifier has four other standard parameters, outlined in Table 2. It is well known in computer vision that 
many of the parameters described here are important. Much research has been devoted to identifying their best 
values, but results are dataset specific, without definitive general answers. 

Table 2. Parameter ranges for random forest classifier. For the purpose of optimization, the maximum tree depth and the 
number of potential offsets were transformed to log space. 

Parameter Range 

Max. tree depth [1 60] 

Min. No. samples for non leaf nodes [1 100] 

No. potential offsets to evaluate [1 5000] 

Bootstrap for per tree sampling [T F] 

The objective in optimizing these RF classifier parameters is to find a parameter setting that learns the best 
classifier in a given time budget of five minutes. To enable competitive performance in this short amount of time, 
at each node of the tree only a random subset of data points is considered. Also note that the above parameters 
do not include the number of trees T in the random forest; since performance improves monotonically in T, we 
created as many trees as possible in the time budget. Trees are constructed depth first and returned in their 
current state when the time budget is exceeded. Using a fixed budget results in a subtle optimization problem 
because of the complex interactions between the various parameters (maximum depth, number of potential 
offsets, number of trees and accuracy). 

It is unclear a priori whether a low-dimensional subspace of these 14 interacting parameters exists that captures 
the classification accuracy of the resulting random forests. We performed large-scale computational experiments 
with REMBO, random search, and standard Bayesian optimization (BO) to study this question. In this experi- 
ment, we used the high-dimensional kernel for REMBO to avoid the potential over-exploration problems of the 
low-dimensional kernel described in Section 3.2. We believed that D = 14 dimensions would be small enough to 
avoid inefficiencies in fitting the GP in D dimensions. This belief was confirmed by the observation that standard 
BO (which operates in D = 14 dimensions) performed well for this problem. 

Figure 6 (left) shows the results that can be obtained by a single run of random search, BO, and REMBO. 
Remarkably, REMBO clearly outperformed random search, even based on as few as d = 3 dimensions.^ However, 
since the extrinsic dimensionality was "only" a moderate D = 14, standard Bayesian optimization performed 
well, and since it was not limited to a low-dimensional subspace it outperformed REMBO. Nevertheless, several 
REMBO runs actually performed very well, comparably with the best runs of BO. Consequently, when running 
/c = 4 interleaved runs of each method, REMBO performed almost as well as BO, matching its performance up 
to about 450 function evaluations (see Figure 6, right). 

We conclude that the parameter space of this RF classifier does not appear to have a clear low effective dimen- 
sionality; since the extrinsic dimensionality is only moderate, this leads REMBO to perform somewhat worse 
than standard Bayesian optimization, but it is still possible to achieve reasonable performance based on as little 
as d = 3 dimensions. 

This experiment also shows that automatic configuration techniques can reveal scientific facts about the problem; 
for example how to choose the depth of trees in RFs. For this reason, we feel it is important to advance to the 
machine learning community the following message about methodology. For any specific dataset, if researchers 

^Due to the very large computational expense of this experiment (in total over half a year of CPU time), we only 
performed conclusive experiments with d = 3; preliminary runs of REMBO with d = 4 performed somewhat worse than 
those with d = 3 for a budget of 200 function evaluations, but were still improving at that point. 
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Figure 6. Performance of various methods for optimizing RF parameters for body part classification. For all methods, 
we show RF accuracy (mean ±1/4 standard deviation across 10 runs) for all 2.2 million non background pixels in the 
500-pose validation set, using the RF parameters identified by the method. The results on the test set were within 1% 
of the results on the validation set. Left: performance with a single run of each method; Right: performance with A; = 4 
interleaved runs. 



were to release the obtained objective function evaluations, other researchers could use these values to expedite 
their experiments and gain greater knowledge about the problem domain. For example, our experiments with 
RFs took many days with powerful clusters of computers. By releasing not only the code but the samples of 
the objective function, other researchers could build on this data and ultimately learn a model for the objective 
function in this domain and, therefore, understand how the random forests parameters and design choices interact 
and affect performance. Every experiment should count. 



